Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 19 de 19
Filter
1.
Article in English | MEDLINE | ID: mdl-38663087

ABSTRACT

The Human Genome Project was an enormous accomplishment, providing a foundation for countless explorations into the genetics and genomics of the human species. Yet for many years, the human genome reference sequence remained incomplete and lacked representation of human genetic diversity. Recently, two major advances have emerged to address these shortcomings: complete gap-free human genome sequences, such as the one developed by the Telomere-to-Telomere Consortium, and high-quality pangenomes, such as the one developed by the Human Pangenome Reference Consortium. Facilitated by advances in long-read DNA sequencing and genome assembly algorithms, complete human genome sequences resolve regions that have been historically difficult to sequence, including centromeres, telomeres, and segmental duplications. In parallel, pangenomes capture the extensive genetic diversity across populations worldwide. Together, these advances usher in a new era of genomics research, enhancing the accuracy of genomic analysis, paving the path for precision medicine, and contributing to deeper insights into human biology.

2.
Genome Res ; 34(3): 454-468, 2024 Apr 25.
Article in English | MEDLINE | ID: mdl-38627094

ABSTRACT

Reference-free genome phasing is vital for understanding allele inheritance and the impact of single-molecule DNA variation on phenotypes. To achieve thorough phasing across homozygous or repetitive regions of the genome, long-read sequencing technologies are often used to perform phased de novo assembly. As a step toward reducing the cost and complexity of this type of analysis, we describe new methods for accurately phasing Oxford Nanopore Technologies (ONT) sequence data with the Shasta genome assembler and a modular tool for extending phasing to the chromosome scale called GFAse. We test using new variants of ONT PromethION sequencing, including those using proximity ligation, and show that newer, higher accuracy ONT reads substantially improve assembly quality.


Subject(s)
Nanopores , Humans , Sequence Analysis, DNA/methods , Nanopore Sequencing/methods , High-Throughput Nucleotide Sequencing/methods , Software , Genomics/methods
3.
Nat Biotechnol ; 42(4): 663-673, 2024 Apr.
Article in English | MEDLINE | ID: mdl-37165083

ABSTRACT

Pangenome references address biases of reference genomes by storing a representative set of diverse haplotypes and their alignment, usually as a graph. Alternate alleles determined by variant callers can be used to construct pangenome graphs, but advances in long-read sequencing are leading to widely available, high-quality phased assemblies. Constructing a pangenome graph directly from assemblies, as opposed to variant calls, leverages the graph's ability to represent variation at different scales. Here we present the Minigraph-Cactus pangenome pipeline, which creates pangenomes directly from whole-genome alignments, and demonstrate its ability to scale to 90 human haplotypes from the Human Pangenome Reference Consortium. The method builds graphs containing all forms of genetic variation while allowing use of current mapping and genotyping tools. We measure the effect of the quality and completeness of reference genomes used for analysis within the pangenomes and show that using the CHM13 reference from the Telomere-to-Telomere Consortium improves the accuracy of our methods. We also demonstrate construction of a Drosophila melanogaster pangenome.


Subject(s)
Drosophila melanogaster , High-Throughput Nucleotide Sequencing , Humans , Animals , Drosophila melanogaster/genetics , Haplotypes/genetics , High-Throughput Nucleotide Sequencing/methods , Alleles , Sequence Analysis, DNA , Genome, Human/genetics
4.
Nature ; 617(7960): 312-324, 2023 05.
Article in English | MEDLINE | ID: mdl-37165242

ABSTRACT

Here the Human Pangenome Reference Consortium presents a first draft of the human pangenome reference. The pangenome contains 47 phased, diploid assemblies from a cohort of genetically diverse individuals1. These assemblies cover more than 99% of the expected sequence in each genome and are more than 99% accurate at the structural and base pair levels. Based on alignments of the assemblies, we generate a draft pangenome that captures known variants and haplotypes and reveals new alleles at structurally complex loci. We also add 119 million base pairs of euchromatic polymorphic sequences and 1,115 gene duplications relative to the existing reference GRCh38. Roughly 90 million of the additional base pairs are derived from structural variation. Using our draft pangenome to analyse short-read data reduced small variant discovery errors by 34% and increased the number of structural variants detected per haplotype by 104% compared with GRCh38-based workflows, which enabled the typing of the vast majority of structural variant alleles per sample.


Subject(s)
Genome, Human , Genomics , Humans , Diploidy , Genome, Human/genetics , Haplotypes/genetics , Sequence Analysis, DNA , Genomics/standards , Reference Standards , Cohort Studies , Alleles , Genetic Variation
5.
Bioinformatics ; 39(2)2023 02 03.
Article in English | MEDLINE | ID: mdl-36749013

ABSTRACT

MOTIVATION: Pairwise sequence alignment remains a fundamental problem in computational biology and bioinformatics. Recent advances in genomics and sequencing technologies demand faster and scalable algorithms that can cope with the ever-increasing sequence lengths. Classical pairwise alignment algorithms based on dynamic programming are strongly limited by quadratic requirements in time and memory. The recently proposed wavefront alignment algorithm (WFA) introduced an efficient algorithm to perform exact gap-affine alignment in O(ns) time, where s is the optimal score and n is the sequence length. Notwithstanding these bounds, WFA's O(s2) memory requirements become computationally impractical for genome-scale alignments, leading to a need for further improvement. RESULTS: In this article, we present the bidirectional WFA algorithm, the first gap-affine algorithm capable of computing optimal alignments in O(s) memory while retaining WFA's time complexity of O(ns). As a result, this work improves the lowest known memory bound O(n) to compute gap-affine alignments. In practice, our implementation never requires more than a few hundred MBs aligning noisy Oxford Nanopore Technologies reads up to 1 Mbp long while maintaining competitive execution times. AVAILABILITY AND IMPLEMENTATION: All code is publicly available at https://github.com/smarco/BiWFA-paper. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Algorithms , Genomics , Computational Biology , Genome , Sequence Analysis, DNA , Software
6.
Nat Methods ; 20(2): 239-247, 2023 02.
Article in English | MEDLINE | ID: mdl-36646895

ABSTRACT

Pangenomics is emerging as a powerful computational paradigm in bioinformatics. This field uses population-level genome reference structures, typically consisting of a sequence graph, to mitigate reference bias and facilitate analyses that were challenging with previous reference-based methods. In this work, we extend these methods into transcriptomics to analyze sequencing data using the pantranscriptome: a population-level transcriptomic reference. Our toolchain, which consists of additions to the VG toolkit and a standalone tool, RPVG, can construct spliced pangenome graphs, map RNA sequencing data to these graphs, and perform haplotype-aware expression quantification of transcripts in a pantranscriptome. We show that this workflow improves accuracy over state-of-the-art RNA sequencing mapping methods, and that it can efficiently quantify haplotype-specific transcript expression without needing to characterize the haplotypes of a sample beforehand.


Subject(s)
Computational Biology , Gene Expression Profiling , Haplotypes , Metagenomics , Transcriptome
7.
bioRxiv ; 2023 Dec 15.
Article in English | MEDLINE | ID: mdl-38168361

ABSTRACT

Pangenomes, by including genetic diversity, should reduce reference bias by better representing new samples compared to them. Yet when comparing a new sample to a pangenome, variants in the pangenome that are not part of the sample can be misleading, for example, causing false read mappings. These irrelevant variants are generally rarer in terms of allele frequency, and have previously been dealt with using allele frequency filters. However, this is a blunt heuristic that both fails to remove some irrelevant variants and removes many relevant variants. We propose a new approach, inspired by local ancestry inference methods, that imputes a personalized pangenome subgraph based on sampling local haplotypes according to k-mer counts in the reads. Our approach is tailored for the Giraffe short read aligner, as the indexes it needs for read mapping can be built quickly. We compare the accuracy of our approach to state-of-the-art methods using graphs from the Human Pangenome Reference Consortium. The resulting personalized pangenome pipelines provide faster pangenome read mapping than comparable pipelines that use a linear reference, reduce small variant genotyping errors by 4x relative to the Genome Analysis Toolkit (GATK) best-practice pipeline, and for the first time make short-read structural variant genotyping competitive with long-read discovery methods.

8.
Science ; 374(6574): abg8871, 2021 Dec 17.
Article in English | MEDLINE | ID: mdl-34914532

ABSTRACT

We introduce Giraffe, a pangenome short-read mapper that can efficiently map to a collection of haplotypes threaded through a sequence graph. Giraffe maps sequencing reads to thousands of human genomes at a speed comparable to that of standard methods mapping to a single reference genome. The increased mapping accuracy enables downstream improvements in genome-wide genotyping pipelines for both small variants and larger structural variants. We used Giraffe to genotype 167,000 structural variants, discovered in long-read studies, in 5202 diverse human genomes that were sequenced using short reads. We conclude that pangenomics facilitates a more comprehensive characterization of variation and, as a result, has the potential to improve many genomic analyses.


Subject(s)
Genetic Variation , Genome, Human , Genomics/methods , Genotyping Techniques , Algorithms , Alleles , Computational Biology , Genome, Fungal , Genotype , Haplotypes , High-Throughput Nucleotide Sequencing , Humans , Polymorphism, Single Nucleotide , Quantitative Trait Loci , Saccharomyces/genetics , Saccharomyces cerevisiae/genetics , Sequence Analysis, DNA
9.
Nat Methods ; 18(11): 1322-1332, 2021 11.
Article in English | MEDLINE | ID: mdl-34725481

ABSTRACT

Long-read sequencing has the potential to transform variant detection by reaching currently difficult-to-map regions and routinely linking together adjacent variations to enable read-based phasing. Third-generation nanopore sequence data have demonstrated a long read length, but current interpretation methods for their novel pore-based signal have unique error profiles, making accurate analysis challenging. Here, we introduce a haplotype-aware variant calling pipeline, PEPPER-Margin-DeepVariant, that produces state-of-the-art variant calling results with nanopore data. We show that our nanopore-based method outperforms the short-read-based single-nucleotide-variant identification method at the whole-genome scale and produces high-quality single-nucleotide variants in segmental duplications and low-mappability regions where short-read-based genotyping fails. We show that our pipeline can provide highly contiguous phase blocks across the genome with nanopore reads, contiguously spanning between 85% and 92% of annotated genes across six samples. We also extend PEPPER-Margin-DeepVariant to PacBio HiFi data, providing an efficient solution with superior performance over the current WhatsHap-DeepVariant standard. Finally, we demonstrate de novo assembly polishing methods that use nanopore and PacBio HiFi reads to produce diploid assemblies with high accuracy (Q35+ nanopore-polished and Q40+ PacBio HiFi-polished).


Subject(s)
Genes , Haplotypes , High-Throughput Nucleotide Sequencing/methods , Nanopores , Polymorphism, Single Nucleotide , Sequence Analysis, DNA/methods , Software , Genome, Human , Humans , Molecular Sequence Annotation
10.
Bioinformatics ; 36(21): 5139-5144, 2021 01 29.
Article in English | MEDLINE | ID: mdl-33040146

ABSTRACT

MOTIVATION: Pangenomics is a growing field within computational genomics. Many pangenomic analyses use bidirected sequence graphs as their core data model. However, implementing and correctly using this data model can be difficult, and the scale of pangenomic datasets can be challenging to work at. These challenges have impeded progress in this field. RESULTS: Here, we present a stack of two C++ libraries, libbdsg and libhandlegraph, which use a simple, field-proven interface, designed to expose elementary features of these graphs while preventing common graph manipulation mistakes. The libraries also provide a Python binding. Using a diverse collection of pangenome graphs, we demonstrate that these tools allow for efficient construction and manipulation of large genome graphs with dense variation. For instance, the speed and memory usage are up to an order of magnitude better than the prior graph implementation in the VG toolkit, which has now transitioned to using libbdsg's implementations. AVAILABILITY AND IMPLEMENTATION: libhandlegraph and libbdsg are available under an MIT License from https://github.com/vgteam/libhandlegraph and https://github.com/vgteam/libbdsg.


Subject(s)
Libraries , Software , Genome , Genomics
11.
Annu Rev Genomics Hum Genet ; 21: 139-162, 2020 08 31.
Article in English | MEDLINE | ID: mdl-32453966

ABSTRACT

Low-cost whole-genome assembly has enabled the collection of haplotype-resolved pangenomes for numerous organisms. In turn, this technological change is encouraging the development of methods that can precisely address the sequence and variation described in large collections of related genomes. These approaches often use graphical models of the pangenome to support algorithms for sequence alignment, visualization, functional genomics, and association studies. The additional information provided to these methods by the pangenome allows them to achieve superior performance on a variety of bioinformatic tasks, including read alignment, variant calling, and genotyping. Pangenome graphs stand to become a ubiquitous tool in genomics. Although it is unclear whether they will replace linearreference genomes, their ability to harmoniously relate multiple sequence and coordinate systems will make them useful irrespective of which pangenomic models become most common in the future.


Subject(s)
Algorithms , Computational Biology/methods , Computer Graphics , Genome, Human , High-Throughput Nucleotide Sequencing , Humans , Sequence Analysis, DNA
12.
JCO Clin Cancer Inform ; 4: 160-170, 2020 02.
Article in English | MEDLINE | ID: mdl-32097024

ABSTRACT

PURPOSE: Many antineoplastics are designed to target upregulated genes, but quantifying upregulation in a single patient sample requires an appropriate set of samples for comparison. In cancer, the most natural comparison set is unaffected samples from the matching tissue, but there are often too few available unaffected samples to overcome high intersample variance. Moreover, some cancer samples have misidentified tissues of origin or even composite-tissue phenotypes. Even if an appropriate comparison set can be identified, most differential expression tools are not designed to accommodate comparisons to a single patient sample. METHODS: We propose a Bayesian statistical framework for gene expression outlier detection in single samples. Our method uses all available data to produce a consensus background distribution for each gene of interest without requiring the researcher to manually select a comparison set. The consensus distribution can then be used to quantify over- and underexpression. RESULTS: We demonstrate this method on both simulated and real gene expression data. We show that it can robustly quantify overexpression, even when the set of comparison samples lacks ideally matched tissue samples. Furthermore, our results show that the method can identify appropriate comparison sets from samples of mixed lineage and rediscover numerous known gene-cancer expression patterns. CONCLUSION: This exploratory method is suitable for identifying expression outliers from comparative RNA sequencing (RNA-seq) analysis for individual samples, and Treehouse, a pediatric precision medicine group that leverages RNA-seq to identify potential therapeutic leads for patients, plans to explore this method for processing its pediatric cohort.


Subject(s)
Algorithms , Bayes Theorem , Biomarkers, Tumor/metabolism , Gene Expression Profiling , Gene Expression Regulation, Neoplastic , Neoplasms/pathology , Biomarkers, Tumor/genetics , Humans , Neoplasms/genetics , Neoplasms/metabolism , Prognosis
13.
PeerJ ; 8: e8356, 2020.
Article in English | MEDLINE | ID: mdl-32025367

ABSTRACT

To date, five ctenophore species' mitochondrial genomes have been sequenced, and each contains open reading frames (ORFs) that if translated have no identifiable orthologs. ORFs with no identifiable orthologs are called unidentified reading frames (URFs). If truly protein-coding, ctenophore mitochondrial URFs represent a little understood path in early-diverging metazoan mitochondrial evolution and metabolism. We sequenced and annotated the mitochondrial genomes of three individuals of the beroid ctenophore Beroe forskalii and found that in addition to sharing the same canonical mitochondrial genes as other ctenophores, the B. forskalii mitochondrial genome contains two URFs. These URFs are conserved among the three individuals but not found in other sequenced species. We developed computational tools called pauvre and cuttlery to determine the likelihood that URFs are protein coding. There is evidence that the two URFs are under negative selection, and a novel Bayesian hypothesis test of trinucleotide frequency shows that the URFs are more similar to known coding genes than noncoding intergenic sequence. Protein structure and function prediction of all ctenophore URFs suggests that they all code for transmembrane transport proteins. These findings, along with the presence of URFs in other sequenced ctenophore mitochondrial genomes, suggest that ctenophores may have uncharacterized transmembrane proteins present in their mitochondria.

14.
Nat Biotechnol ; 36(9): 875-879, 2018 10.
Article in English | MEDLINE | ID: mdl-30125266

ABSTRACT

Reference genomes guide our interpretation of DNA sequence data. However, conventional linear references represent only one version of each locus, ignoring variation in the population. Poor representation of an individual's genome sequence impacts read mapping and introduces bias. Variation graphs are bidirected DNA sequence graphs that compactly represent genetic variation across a population, including large-scale structural variation such as inversions and duplications. Previous graph genome software implementations have been limited by scalability or topological constraints. Here we present vg, a toolkit of computational methods for creating, manipulating, and using these structures as references at the scale of the human genome. vg provides an efficient approach to mapping reads onto arbitrary variation graphs using generalized compressed suffix arrays, with improved accuracy over alignment to a linear reference, and effectively removing reference bias. These capabilities make using variation graphs as references for DNA sequencing practical at a gigabase scale, or at the topological complexity of de novo assemblies.


Subject(s)
Genetic Variation , Computer Simulation , DNA/genetics , Humans
15.
J Comput Biol ; 25(7): 664-676, 2018 07.
Article in English | MEDLINE | ID: mdl-29792514

ABSTRACT

Efforts to incorporate human genetic variation into the reference human genome have converged on the idea of a graph representation of genetic variation within a species, a genome sequence graph. A sequence graph represents a set of individual haploid reference genomes as paths in a single graph. When that set of reference genomes is sufficiently diverse, the sequence graph implicitly contains all frequent human genetic variations, including translocations, inversions, deletions, and insertions. In representing a set of genomes as a sequence graph, one encounters certain challenges. One of the most important is the problem of graph linearization, essential both for efficiency of storage and access, and for natural graph visualization and compatibility with other tools. The goal of graph linearization is to order nodes of the graph in such a way that operations such as access, traversal, and visualization are as efficient and effective as possible. A new algorithm for the linearization of sequence graphs, called the flow procedure (FP), is proposed in this article. Comparative experimental evaluation of the FP against other algorithms shows that it outperforms its rivals in the metrics most relevant to sequence graphs.


Subject(s)
Computational Biology/statistics & numerical data , Genome, Human/genetics , Genomics/methods , Algorithms , Base Sequence/genetics , Chromosome Mapping/statistics & numerical data , Genomics/statistics & numerical data , Humans , Translocation, Genetic/genetics
16.
J Comput Biol ; 25(7): 649-663, 2018 07.
Article in English | MEDLINE | ID: mdl-29461862

ABSTRACT

A superbubble is a type of directed acyclic subgraph with single distinct source and sink vertices. In genome assembly and genetics, the possible paths through a superbubble can be considered to represent the set of possible sequences at a location in a genome. Bidirected and biedged graphs are a generalization of digraphs that are increasingly being used to more fully represent genome assembly and variation problems. In this study, we define snarls and ultrabubbles, generalizations of superbubbles for bidirected and biedged graphs, and give an efficient algorithm for the detection of these more general structures. Key to this algorithm is the cactus graph, which, we show, encodes the nested decomposition of a graph into snarls and ultrabubbles within its structure. We propose and demonstrate empirically that this decomposition on bidirected and biedged graphs solves a fundamental problem by defining genetic sites for any collection of genomic variations, including complex structural variations, without need for any single reference genome coordinate system. Further, the nesting of the decomposition gives a natural way to describe and model variations contained within large variations, a case not currently dealt with by existing formats [e.g., variant cell format (VCF)].


Subject(s)
Computational Biology/methods , Genome/genetics , Genomic Structural Variation/genetics , Algorithms , Molecular Sequence Annotation/methods , Reference Standards , Sequence Analysis, DNA , Software
17.
Forensic Sci Int Genet ; 30: 93-105, 2017 09.
Article in English | MEDLINE | ID: mdl-28667863

ABSTRACT

Massively parallel (next-generation) sequencing provides a powerful method to analyze DNA from many different sources, including degraded and trace samples. A common challenge, however, is that many forensic samples are often known or suspected mixtures of DNA from multiple individuals. Haploid lineage markers, such as mitochondrial (mt) DNA, are useful for analysis of mixtures because, unlike nuclear genetic markers, each individual contributes a single sequence to the mixture. Deconvolution of these mixtures into the constituent mitochondrial haplotypes is challenging as typical sequence read lengths are too short to reconstruct the distinct haplotypes completely. We present a powerful computational approach for determining the constituent haplotypes in massively parallel sequencing data from potentially mixed samples. At the heart of our approach is an expectation maximization based algorithm that co-estimates the overall mixture proportions and the source haplogroup for each read individually. This approach, implemented in the software package mixemt, correctly identifies haplogroups from mixed samples across a range of mixture proportions. Furthermore, our method can separate fragments in a mixed sample by the most likely originating contributor and generate reconstructions of the constituent haplotypes based on known patterns of mtDNA diversity.


Subject(s)
DNA, Mitochondrial/genetics , Haplotypes , High-Throughput Nucleotide Sequencing , Phylogeny , Sequence Analysis, DNA , Algorithms , Humans , Racial Groups/genetics
18.
Genome Res ; 27(5): 665-676, 2017 05.
Article in English | MEDLINE | ID: mdl-28360232

ABSTRACT

The human reference genome is part of the foundation of modern human biology and a monumental scientific achievement. However, because it excludes a great deal of common human variation, it introduces a pervasive reference bias into the field of human genomics. To reduce this bias, it makes sense to draw on representative collections of human genomes, brought together into reference cohorts. There are a number of techniques to represent and organize data gleaned from these cohorts, many using ideas implicitly or explicitly borrowed from graph-based models. Here, we survey various projects underway to build and apply these graph-based structures-which we collectively refer to as genome graphs-and discuss the improvements in read mapping, variant calling, and haplotype determination that genome graphs are expected to produce.


Subject(s)
Genome, Human , Genome-Wide Association Study/methods , Genomics/methods , Genome-Wide Association Study/standards , Genomics/standards , Humans , Polymorphism, Genetic
19.
Nat Methods ; 14(4): 411-413, 2017 Apr.
Article in English | MEDLINE | ID: mdl-28218897

ABSTRACT

DNA chemical modifications regulate genomic function. We present a framework for mapping cytosine and adenosine methylation with the Oxford Nanopore Technologies MinION using this nanopore sequencer's ionic current signal. We map three cytosine variants and two adenine variants. The results show that our model is sensitive enough to detect changes in genomic DNA methylation levels as a function of growth phase in Escherichia coli.


Subject(s)
5-Methylcytosine/metabolism , DNA Methylation , High-Throughput Nucleotide Sequencing/methods , Nanopores , 5-Methylcytosine/analysis , Escherichia coli/genetics , Genome, Bacterial , High-Throughput Nucleotide Sequencing/instrumentation , Markov Chains , Models, Genetic
SELECTION OF CITATIONS
SEARCH DETAIL
...